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Abstract.  This  work  introduces  the  minimax  Laplace  transform  method,  a  modification  of  the 
cumulant-based  matrix  Laplace  transform  method  developed  in  [Trollc]  that  yields  both  upper  and 
lower  bounds  on  each  eigenvalue  of  a  sum  of  random  self-adjoint  matrices.  This  machinery  is  used 
to  derive  eigenvalue  analogs  of  the  classical  Chernoff,  Bennett,  and  Bernstein  bounds. 

Two  examples  demonstrate  the  efficacy  of  the  minimax  Laplace  transform.  The  first  concerns 
the  effects  of  column  sparsification  on  the  spectrum  of  a  matrix  with  orthonormal  rows.  Here,  the 
behavior  of  the  singular  values  can  be  described  in  terms  of  coherence-like  quantities.  The  second 
example  addresses  the  question  of  relative  accuracy  in  the  estimation  of  eigenvalues  of  the  covariance 
matrix  of  a  random  process.  Standard  results  on  the  convergence  of  sample  covariance  matrices 
provide  bounds  on  the  number  of  samples  needed  to  obtain  relative  accuracy  in  the  spectral  norm, 
but  these  results  only  guarantee  relative  accuracy  in  the  estimate  of  the  maximum  eigenvalue.  The 
minimax  Laplace  transform  argument  establishes  that  if  the  lowest  eigenvalues  decay  sufficiently 
fast,  Q(e~2 K(£\ogp)  samples,  where  Ke  =  Ai(C')/Ar(C),  are  sufficient  to  ensure  that  the  dominant 
i  eigenvalues  of  the  covariance  matrix  of  a  A/”(0,  C)  random  vector  are  estimated  to  within  a  factor 
of  1  ±  e  with  high  probability. 


1.  Introduction 

The  field  of  nonasymptotic  random  matrix  theory  has  traditionally  focused  on  the  problem  of 
bounding  the  extreme  eigenvalues  of  a  random  matrix.  In  some  circumstances,  however,  we  may 
also  be  interested  in  studying  the  behavior  of  the  interior  eigenvalues.  In  this  case,  classical  tools 
do  not  readily  apply.  Indeed,  the  interior  eigenvalues  are  determined  by  the  min- max  of  a  random 
process,  which  is  very  challenging  to  control. 

This  paper  demonstrates  that  it  is  possible  to  combine  the  matrix  Laplace  transform  method 
detailed  in  [Trollc]  with  the  Courant-Fischer  characterization  of  eigenvalues  to  obtain  nontrivial 
bounds  on  the  interior  eigenvalues  of  a  sum  of  random  self-adjoint  matrices.  This  approach  expands 
the  scope  of  the  matrix  probability  inequalities  from  [Trollc]  so  that  they  provide  interesting 
information  about  the  bulk  spectrum. 

As  one  application  of  our  approach,  we  investigate  estimates  for  the  covariance  matrix  of  a 
centered  stationary  random  process.  We  show  that  the  eigenvalues  of  the  sample  covariance  matrix 
provide  relative-error  approximations  to  the  eigenvalues  of  the  covariance  matrix.  We  focus  on 
Gaussian  processes,  but  our  arguments  can  be  extended  to  other  distributions.  The  following 
theorem  distills  the  results  in  section  7. 

Theorem  1.1.  Let  C  6  M.pxp  be  positive  semidefinite.  Fix  an  integer  t  <  p  and  assume  the  tail 
{A i(C')}j>^  of  the  spectrum  of  C  decays  sufficiently  fast  that 
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Let  {r]j}1j=1  C  Mp  be  i.i.d.  samples  drawn  from  a  jV(0,  C)  distribution.  Define  the  sample  covariance 
matrix 


6  -1^ 


n 


Let  ne  be  the  condition  number  associated  with  a  dominant  l- dimensional  invariant  subspace  of  C, 

Ai(C) 


= 


A e(C) ' 


If  n  =  f2(e  2n2llog p),  then  with  high  probability 

\\k(Cn)  -  \k(C)\  <e\k(C)  fork  = 


Thus,  assuming  sufficiently  fast  decay  of  the  residual  eigenvalues,  n  =  Ll(e~2  n2l\ogp)  samples 
ensure  that  the  top  I  eigenvalues  of  C  are  captured  to  relative  precision.  Spectral  decay  of  this 
sort  is  encountered  when,  e.g.,  the  residual  eigenvalues  of  C  decay  like  for  some  6  >  0  or 

when  they  arise  from  measurements  corrupted  by  low-power  white  noise. 

We  contrast  Theorem  1.1  with  established  spectral  norm  error  bounds  for  covariance  estimation, 
which  do  not  exploit  spectral  decay  and  require  that  n  =  Q(e~2K2p)  samples  be  taken  to  capture  the 
top  i  eigenvalues  to  relative  precision  (see  section  7).  The  estimate  in  Theorem  1.1  can  be  sharpened 
using  information  about  the  spectrum  of  C  and  the  desired  failure  probability  or  modified  to  account 
for  different  types  of  spectral  decay.  The  same  tools  used  in  the  proof  of  the  theorem  can  be  used 
to  estimate  A k{Cn  —  C ). 

1.1.  Related  Work.  We  believe  that  this  paper  contains  the  first  general-purpose  tools  for  study¬ 
ing  the  full  spectrum  of  a  finite-dimensional  random  matrix.  The  literature  on  random  matrix 
theory  (RMT)  contains  some  complementary  results,  but  they  do  not  seem  to  apply  with  the  same 
generality.  Methods  from  RMT  fall  into  two  rough  categories:  asymptotic  methods  and  nonasymp- 
totic  methods.  We  discuss  the  relevant  results  from  each  in  turn. 

The  modern  asymptotic  theory  began  in  the  1950s  when  physicists  observed  that,  on  certain 
scales,  the  behavior  of  a  quantum  system  is  described  by  the  spectrum  of  a  random  matrix  [Meh04]. 
They  further  observed  the  phenomenon  of  universality:  as  the  dimension  increases,  the  spectral 
statistics  become  independent  of  the  distribution  of  the  random  matrix;  instead,  they  are  deter¬ 
mined  by  the  symmetries  of  the  distribution  [Dei07].  Since  these  initial  observations,  physicists, 
statisticians,  engineers,  and  mathematicians  have  found  manifold  applications  of  the  asymptotic 
theory  in  high-dimensional  statistics  [JohOl,  Joh07,  El  08],  physics  [GMGW98,  Meh04],  wireless 
communication  [TV04,  ST06],  and  pure  mathematics  [RS96,  BK99],  to  mention  only  a  few  areas. 

Asymptotic  random  matrix  theory  has  developed  primarily  through  the  examination  of  specific 
classes  of  random  matrices.  We  mention  two  well-studied  classes.  Sample  covariance  matrices 
take  the  form  n~l  BnB*n,  where  the  columns  of  Bn  comprise  n  independent  observations.  Wigner 
matrices  are  Hermitian  matrices  whose  superdiagonal  entries  are  independent,  zero-mean,  and  have 
unit  variance  and  whose  diagonal  entries  are  i.i.d.,  real,  and  have  finite  variance. 

The  fundamental  object  of  study  in  asymptotic  random  matrix  theory  is  the  empirical  spectral 
distribution  function  (ESD).  Given  a  random  Hermitian  matrix  A  of  order  n,  its  ESD 

Fa(x)  =  — #{1  <  i  <  n  :  Aj(A)  <  x\ 
n 

is  a  random  distribution  function  which  encodes  the  statistics  of  the  spectrum  of  A.  Wigner’s 
theorem  [Wig55],  the  seminal  result  of  the  asymptotic  theory,  establishes  that  if  {A„,}  is  a  sequence 
of  independent,  symmetric  n  x  n  matrices  with  i.i.d.  AA(0, 1)  entries  on  and  above  the  diagonal, 
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then  the  expected  ESD  of  n  l/2An  converges  weakly  in  probability,  as  n  approaches  infinity,  to  the 
semicircular  law  given  by 

F(x)  =  J  V^-y2 1  [ — 2,2]  (y)  d v- 

Thus,  at  least  in  the  limiting  sense,  the  spectra  of  these  random  matrices  are  well  characterized. 
Development  of  the  classical  asymptotic  theory  has  been  driven  by  the  natural  question  raised  by 
Wigner’s  result:  to  what  extent  is  the  semicircular  law,  and  more  generally,  the  existence  of  a 
limiting  spectral  distribution  (LSD)  universal? 

The  literature  on  the  existence  and  universality  of  LSDs  is  massive;  we  mention  only  the  high¬ 
lights.  It  is  now  known  that  the  semicircular  law  is  universal  for  Wigner  matrices.  Suppose  that 
{ An}  is  a  sequence  of  independent  nxn  Wigner  matrices.  Grenander  established  that  if  all  the  mo¬ 
ments  are  finite,  then  the  ESD  of  n_1/2A„  converges  weakly  to  the  semicircular  law  in  probability 
[Gre63].  Arnold  showed  that,  assuming  a  finite  fourth  moment,  the  ESD  almost  surely  converges 
weakly  to  the  semicircular  law  [Arn71].  Around  the  same  time,  Marcenko  and  Pastur  determined 
the  form  of  the  limiting  spectral  distribution  of  sample  covariance  matrices  [MP67]. 

More  recently,  Tao  and  Vu  confirmed  the  long-conjectured  circular  law  hypothesis.  Let  {Cn}  be 
a  sequence  of  independent  nxn  matrices  whose  entries  are  i.i.d.  and  have  unit  variance.  Then  the 
ESD  of  n~l/2Cn  converges  weakly  to  the  uniform  measure  on  the  unit  disk,  both  in  probability 
and  almost  surely  [TVlOb]. 

Although  the  convergence  rate  of  the  ESD  has  considerable  practical  interest,  it  was  not  until  1993 
that  theoretical  results  became  available  when  Bai  showed  that  for  Wigner  matrices  [Bai93a]  and 
sample  covariance  matrices  [Bai93b]  the  expected  ESDs  of  n_1/2A„  and  n~l BnB^,  respectively, 
both  converge  pointwise  at  a  rate  of  0(n-1/4).  Later,  Bai  and  coauthors  established  the  pointwise 
convergence  in  probability  of  the  ESD  of  the  normalized  Wigner  matrix  n~l/2An  [BMT97]  and 
greatly  improved  the  convergence  rates  [BMT99,  BMT02,  BMY03].  The  strongest  result  to  date 
is  due  to  Bai  et  ah,  who  have  shown  that,  if  the  entries  of  the  Wigner  matrix  possess  finite  sixth 
moments,  then  pointwise  convergence  in  probability  of  the  ESD  of  n~x!2  Ar  occurs  at  the  rate  of 
0(n-V2)  [BHPZ11], 

Classically,  individual  eigenvalues  have  been  studied  through  the  limiting  behavior  of  the  ex¬ 
tremal  eigenvalues  and  the  asymptotic  joint  distribution  of  several  eigenvalues.  Much  is  known 
about  the  limiting  distribution  of  the  largest  eigenvalues  of  Wigner  and  covariance  matrices.  Ge¬ 
nian  showed  that  if  the  columns  of  Bn  are  drawn  from  a  sufficiently  regular  distribution,  then  the 
largest  eigenvalue  of  the  sample  covariance  matrix  n~ 1  BnB*  converges  almost  surely  to  a  limit 
[Gem80].  Bai,  Yin,  and  coauthors  showed  that  the  existence  of  a  fourth  moment  is  both  necessary 
and  sufficient  for  the  existence  of  such  a  limit  [YBK88,  BSY88].  They  also  identified  necessary  and 
sufficient  conditions  for  the  existence  of  limits  for  the  smallest  and  largest  eigenvalues  of  a  normal¬ 
ized  Wigner  matrix  n-1/2  An  [BY88b].  El  Karoui  has  recently  described  the  limiting  behavior  of 
the  leading  eigenvalues  of  a  large  class  of  sample  covariance  matrices  [El  07]. 

Less  is  known  about  the  rate  of  convergence  of  the  eigenvalues,  but  some  results  are  available. 
Write  the  eigenvalues  of  a  self-adjoint  matrix  A  in  nonincreasing  order  Ai  >  . . .  >  An.  For  1  <  j  <  n, 
the  classical  location  7 j  of  the  jth  eigenvalue  of  the  normalized  Wigner  matrix  n_1//2  An  is  defined 
via  the  relation 

Hi  j 

/  Psc(x)  dx  =  —, 

J- 00  n 

—  An 

where  psc  is  the  density  associated  with  the  semicircular  law.  Intuitively,  the  facts  that  F  ^  — » 
Fsc  and  F (A,-)  =  j/n  suggest  that  ^ Xj  -A  7 j.  Indeed,  it  follows  from  [BY88a,  BY88b]  that 

A  j  =  \fnrtj  +  o(y/n) 
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asymptotically  almost  surely.  Under  the  assumption  that  the  entries  exhibit  uniform  subgaussian 
decay,  Erdos,  Yau,  and  Yin  have  strengthened  this  result  by  showing  that,  up  to  log  factors, 
the  eigenvalues  of  n-1/2 An  are  within  0(n~2/3)  of  their  classical  position  with  high  probability 
[EYY 10] .  More  generally,  Tao  and  Vu  have  established  the  universality  of  a  result  due  to  Gustavsson 
[Gus05]  in  the  complex  Gaussian  Wigner  case:  (log  n)_1//2(y/nAJ-  —  rr/j)  is  asymptotically  normally 
distributed  [TV11].  Further,  they  have  shown  that  eigenvalues  in  the  bulk  of  the  spectrum  (j  = 
D(n))  of  a  Wigner  matrix  satisfy 


E|Aj  —  y/n'jjl2  =  O (n  c), 


for  some  universal  constant  c  >  0  [TVlOa]. 

In  contrast  to  the  asymptotic  theory,  which  remains  to  a  large  extent  driven  by  the  study  of 
particular  classes  of  random  matrices,  the  nonasymptotic  theory  has  developed  as  a  collection 
of  techniques  for  addressing  the  behavior  of  a  broad  range  of  random  matrices.  The  nonasymp¬ 
totic  theory  has  its  roots  in  geometric  functional  analysis  in  the  1970s,  where  random  matrices 
were  used  to  investigate  the  local  properties  of  Banach  spaces  [LM93,  SD01,  VerlO].  Since  then, 
the  nonasymptotic  theory  has  found  applications  in  areas  including  theoretical  computer  science 
[Ach03,  Vem04,  SS08],  machine  learning  [DM05],  optimization  [Nem07,  So09],  and  numerical  linear 
algebra  [DM10,  HMT11,  Mahll], 

As  is  the  case  in  the  asymptotic  theory,  the  sharpest  and  most  comprehensive  results  available 
in  the  nonasymptotic  theory  concern  the  behavior  of  Gaussian  matrices.  The  amenability  of  the 
Gaussian  distribution  makes  it  possible  to  obtain  results  such  as  Szarek’s  nonasymptotic  analog  of 
the  Wigner  semicircle  theorem  for  Gaussian  matrices  [Sza90]  and  Chen  and  Dongarra’s  bounds  on 
the  condition  number  of  Gaussian  matrices  [CD05].  The  properties  of  less  well-behaved  random 
matrices  can  sometimes  be  related  back  to  those  of  Gaussian  matrices  using  probabilistic  tools,  such 
as  symmetrization;  see,  e.g.,  the  derivation  of  Latala’s  bound  on  the  norms  of  zero-mean  random 
matrices  [Lat05]. 

More  generally,  bounds  on  extremal  eigenvalues  can  be  obtained  from  knowledge  of  the  moments 
of  the  entries.  For  example,  the  smallest  singular  value  of  a  square  matrix  with  i.i.d.  zero-mean 
subgaussian  entries  with  unit  variance  is  O (n-1/2)  with  high  probability  [RV08].  Concentration  of 
measure  results,  such  as  Talagrand’s  concentration  inequality  for  product  spaces  [Tal95],  have  also 
contributed  greatly  to  the  nonasymptotic  theory.  We  mention  in  particular  the  work  of  Achlioptas 
and  McSherry  on  randomized  sparsification  of  matrices  [AM01,  AM07],  that  of  Meckes  on  the  norms 
of  random  matrices  [Mec04],  and  that  of  Alon,  Krivelevich  and  Yu  [AKV02]  on  the  concentration  of 
the  largest  eigenvalues  of  random  symmetric  matrices,  all  of  which  are  applications  of  Talagrand’s 
inequality.  In  cases  where  geometric  information  on  the  distribution  of  the  random  matrices  is 
available,  the  tools  of  empirical  process  theory — such  as  the  generic  chaining,  also  due  to  Talagrand 
[Tal05] — can  be  used  to  convert  this  geometric  information  into  information  on  the  spectra.  One 
natural  example  of  such  a  case  consists  of  matrices  whose  rows  are  independently  drawn  from  a 
log-concave  distribution  [MP06,  ALPTJ11], 

The  noncommutative  Khintchine  inequality  (NCKI),  which  bounds  the  moments  of  the  norm  of 
a  sum  of  fixed  matrices  modulated  by  random  signs  [LP86,  LPP91],  is  a  widely  used  tool  in  the 
nonasymptotic  theory.  Despite  its  power,  the  NCKI  is  unwieldy.  To  use  it,  one  must  reduce  the 
problem  to  a  suitable  form  by  applying  symmetrization  and  decoupling  arguments  and  exploiting 
the  equivalence  between  moments  and  tail  bounds.  It  is  often  more  convenient  to  apply  the  NCKI  in 
the  guise  of  a  lemma,  due  to  Rudelson  [Rud99],  that  provides  an  analog  of  the  law  of  large  numbers 
for  sums  of  rank-one  matrices.  This  result  has  found  many  applications,  including  column-subset 
selection  [RV07]  and  the  fast  approximate  solution  of  least-squares  problems  [DMMS11].  The  NCKI 
and  its  corollaries  do  not  always  yield  sharp  results  because  parasitic  logarithmic  factors  arise  in 
many  settings. 
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The  current  paper  is  ultimately  based  on  the  influential  work  of  Ahlswede  and  Winter  [AW02]. 
This  line  of  research  leads  to  explicit  tail  bounds  for  the  maximum  eigenvalue  of  a  sum  of  random 
matrices.  These  probability  inequalities  parallel  the  classical  scalar  tail  bounds  due  to  Bernstein  and 
others.  Matrix  probability  inequalities  allow  us  to  obtain  valuable  information  about  the  maximum 
eigenvalue  of  a  random  matrix  with  very  little  effort.  Furthermore,  they  apply  to  a  wide  variety 
of  random  matrices.  We  note,  however,  that  matrix  probability  inequalities  can  lead  to  parasitic 
logarithmic  factors  similar  to  those  that  emerge  from  the  NCKI. 

Major  contributions  to  the  literature  on  matrix  probability  inequalities  include  the  papers  [CM08, 
Rec09,  Groll].  We  emphasize  two  works  of  Oliveira  [Oli09,  OlilO]  that  go  well  beyond  earlier 
research.  The  sharpest  current  results  appear  in  the  works  of  Tropp  [Trollc,  Trollb,  Trolla]. 
Recently,  Hsu,  Kakade,  and  Zhang  [HKZ11]  have  modified  Tropp’s  approach  to  establish  matrix 
probability  inequalities  that  depend  on  an  intrinsic  dimension  parameter,  rather  than  the  ambient 
dimension. 

1.2.  Outline.  In  section  2,  we  introduce  the  notation  used  in  this  paper  and  state  a  convenient 
version  of  the  Courant -Fischer  theorem.  In  section  3,  we  use  the  Cour  ant -Fischer  theorem  to 
extend  the  Laplace  transform  technique  from  [Trollc]  to  apply  to  all  the  eigenvalues  of  self-adjoint 
matrices,  thereby  obtaining  the  minimax  Laplace  transform.  We  apply  this  technique  in  sections 
4  and  5  to  develop  eigenvalue  analogs  of  the  classical  Chernoff  and  Bernstein  bounds.  The  final 
two  sections  illustrate,  using  two  familiar  problems,  that  the  minimax  Laplace  technique  gives  us 
significantly  more  information  on  the  spectra  of  random  matrices  than  current  approaches.  In 
section  6,  we  use  the  Chernoff  bounds  to  quantify  the  effects  of  column  sparsification  on  all  the 
singular  values  of  matrices  with  orthogonal  rows.  Rr  section  7,  we  consider  the  question  of  how 
fast,  in  relative  error,  the  eigenvalues  of  empirical  covariance  matrices  converge. 


2.  Background  and  Notation 

We  establish  the  notation  used  in  the  sequel  and  state  a  convenient  version  of  the  Courant-Fischer 
theorem. 

Unless  otherwise  stated,  we  work  over  the  complex  field.  The  kth  column  of  the  matrix  A  is 
denoted  by  a&,  and  the  entries  are  denoted  djk  or  ( A)jk ■  We  define  M™a  to  be  the  set  of  self- 
adjoint  matrices  with  dimension  n.  The  eigenvalues  of  a  matrix  A  in  M™a  are  arranged  in  weakly 
decreasing  order:  Amax  (A)  =  Ai(A)  >  A2(A)  >  •  •  •  >  A n(A)  =  Amin  (A) .  Likewise,  singular  values 
of  a  rectangular  matrix  B  with  rank  r  are  ordered  s±(B)  >  S2{B)  >  •  •  •  >  sr(B).  The  spectral  norm 
of  a  matrix  B  is  expressed  as  ||S||.  We  often  compare  self-adjoint  matrices  using  the  semidefinite 
ordering.  In  this  ordering,  A  is  greater  than  or  equal  to  B ,  written  A  A  B  or  B  A  A,  when  A  —  B 
is  positive  semidefinite. 

The  expectation  of  a  random  variable  is  denoted  by  EX.  We  write  X  ~  Bern(p)  to  indicate  that 
X  has  a  Bernoulli  distribution  with  mean  p. 

One  of  our  central  tools  is  the  variational  characterization  of  the  eigenvalues  of  a  self-adjoint 
matrix  given  by  the  Courant-Fischer  theorem.  For  integers  d  and  n  satisfying  1  <  d  <  n,  the 
complex  Stiefel  manifold 

V3  =  {V  &  Cnxd  :  V*V  =  1} 

is  the  collection  of  orthonormal  bases  for  the  d-dimensional  subspaces  of  Cn,  or,  equivalently,  the 
collection  of  all  isometric  embeddings  of  Cd  into  Cn.  Let  A  be  a  self-adjoint  matrix  with  dimension 
n,  and  let  V  €  be  an  orthonormal  basis  for  a  subspace  of  Cn.  Then  the  matrix  V*AV  can  be 
interpreted  as  the  compression  of  A  to  the  space  spanned  by  V. 
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Proposition  2.1  (Courant-Fischer).  Let  A  be  a  self-adjoint  matrix  with  dimension  n.  Then 

A  k(A)=  min  Amax(P*AP)  and  (2. 1) 

vev»_fc+1 

A k(A)  =  max  Amin  (V*  AV) .  (2.2) 

veV£ 

A  matrix  V-  £  Vf'  achieves  equality  in  (2.2)  if  and  only  if  its  columns  span  a  dominant  k- 
dimensional  invariant  subspace  of  A.  Likewise,  a  matrix  V+  £  V”_A.+1  achieves  equality  in  (2.1)  if 
and  only  if  its  columns  span  a  bottom  (n  —  k  +  1) -dimensional  invariant  subspace  of  A. 

The  ±  subscripts  in  Proposition  2.1  are  chosen  to  reflect  the  fact  that  A k(A)  is  the  minimum 
eigenvalue  of  VI AV-  and  the  maximum  eigenvalue  of  VfAV+.  As  a  consequence  of  Proposition 
2.1,  when  A  is  self-adjoint,  A *,(— A)  =  — An_fc+i(A).  This  fact  allows  us  to  use  the  same  techniques 
we  develop  for  bounding  the  eigenvalues  from  above  to  bound  them  from  below. 

3.  Tail  Bounds  For  Interior  Eigenvalues 

In  this  section  we  develop  a  generic  bound  on  the  tail  probabilities  of  eigenvalues  of  sums  of 
independent,  random,  self-adjoint  matrices.  We  establish  this  bound  by  supplementing  the  matrix 
Laplace  transform  methodology  of  [Trollc]  with  Proposition  2.1  and  a  new  result,  due  to  Lieb 
and  Seiringer  [LS05],  on  the  concavity  of  a  certain  trace  function  on  the  cone  of  positive-definite 
matrices. 

First  we  observe  that  the  Courant-Fischer  theorem  allows  us  relate  the  behavior  of  the  /cth 
eigenvalue  of  a  matrix  to  the  behavior  of  the  largest  eigenvalue  of  an  appropriate  compression  of 
the  matrix. 

Theorem  3.1.  Let  X  be  a  random  self-adjoint  matrix  with  dimension  n,  and  let  k  <  n  be  an 
integer.  Then,  for  all  t  £  M, 

P {Afc(X)  >  t\  <  inf  min  (e~e<  •  Etreev  _  (3.1) 

0>ovev;_t+1  l  / 


Proof.  Let  9  be  a  fixed  positive  number.  Then 


P{Afc(X)  >t}  =¥{Xk{9X)  >  9t}  =  p{eA^x)  >  e0t} 

<  e~et  ■  EeAfc(0X)  =  e~dt  •  Eexp  i  min  Amax  (6V*XV) 

|VeV£_fc+1 

The  first  identity  follows  from  the  positive  homogeneity  of  eigenvalue  maps  and  the  second  from 
the  monotonicity  of  the  scalar  exponential  function.  The  final  two  relations  are  Markov’s  inequality 
and  (2.1). 

To  continue,  we  need  to  bound  the  expectation.  Interchange  the  order  of  the  exponential  and 
the  minimum;  then  apply  the  spectral  mapping  theorem  to  see  that 


E  exp 


min  Amax 


{9V*XV) 


=  E  min  Amax  (exp(0F*W V)) 

<  min  EAmax  (exp(0V*Am) 

VeVn-fc+i 

<  min  Etr  exp(9V*  XV). 


The  first  inequality  is  Jensen’s.  The  second  inequality  follows  because  the  exponential  of  a  self- 
adjoint  matrix  is  positive  definite,  so  its  largest  eigenvalue  is  smaller  than  its  trace. 

Combine  these  observations  and  take  the  infimum  over  all  positive  0  to  complete  the  argument. 

□ 
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We  are  interested  in  the  case  where  the  matrix  X  in  Theorem  3.1  can  be  expressed  as  a  sum  of 
independent  random  matrices.  In  this  case,  we  use  the  following  result  to  develop  the  right-hand 
side  of  the  Laplace  transform  bound  (3.1). 


Theorem  3.2.  Consider  a  finite  sequence  {Xj}  of  independent,  random,  self-adjoint  matrices  with 
dimension  n  and  a  sequence  {Aj}  of  fixed  self-adjoint  matrices  with  dimension  n  that  satisfy  the 
relations 


EeXj  eAj . 

(3.2) 

Let  V  £  V£  be 

an  isometric  embedding  of  Ck  into  Cn  for  some  k  <  n.  Then 

Etrexp  V*XjV  |  <  tr  exp  V*AjVj  . 

(3.3) 

In  particular, 

E  tr  exp  {  SfT]  Xi }  <  tr  exp  {  AJ  }  • 

(3.4) 

Theorem  3.2  is  an  extension  of  Lemma  3.4  of  [Trollc],  which  establishes  the  special  case  (3.4). 
The  proof  depends  upon  a  recent  result  due  to  Lieb  and  Seiringer  [LS05,  Thm.  3]  that  extends 
Lieb’s  earlier  result  [Lie73,  Thm.  6]. 

Proposition  3.1  (Lieb-Seiringer  2005).  Let  H  be  a  self-adjoint  matrix  with  dimension  k.  Let 
V  £  be  an  isometric  embedding  of  Ck  into  Cn  for  some  k  <  n.  Then  the  function 

A  i — >  tr  exp  {H  +  V*  (log  A)V} 

is  concave  on  the  cone  of  positive- definite  matrices  in  M”a. 

Proof  of  Theorem  3.2.  First,  note  that  (3.2)  and  the  operator  monotonicity  of  the  matrix  logarithm 
yield  the  following  inequality  for  each  k: 

logEeXfe  ^  Ak.  (3.5) 

Let  Efc  denote  expectation  conditioned  on  the  first  k  summands,  X\  through  Xk.  Then 

Etrexp  V*XjV }  =  EEi  •  •  -E*_i  trexp  j]T  ^  V*XjV  +  V*  (logeXf)  p} 

<EEi---E*_2trexp{^.  _  V*XjV  +  V*  (logEeXf)  V | 

<  EEi  •  •  •  E^_2  trexp  {  V*XjV  +  V*  (logeAf)  f} 

=  EEi  •  •  •  E^_2  trexp  {  V*XjV  +  V*  A^v}  . 

The  first  inequality  follows  from  Proposition  3.1  and  Jensen’s  inequality,  and  the  second  depends 
on  (3.5)  and  the  monotonicity  of  the  trace  exponential.  Iterate  this  argument  to  complete  the 
proof.  □ 

Our  main  result  follows  from  combining  Theorem  3.1  and  Theorem  3.2. 

Theorem  3.3  (Minimax  Laplace  Transform).  Consider  a  finite  sequence  {Xj}  of  independent, 
random,  self-adjoint  matrices  with  dimension  n,  and  let  k  <  n  be  an  integer. 

(i)  Let  {Aj}  be  a  sequence  of  self-adjoint  matrices  that  satisfy  the  semidefinite  relations 

Ee0X^  ■<  e9(.e)Aj 

where  g  :  (0,  oo)  — »•  [0,  oo).  Then,  for  all  t  £  M, 

P  {Afc  (Ej  Xd  P/)  ^  ^f0  ve“in,+1  °  dt ' tr6Xp  i9{6)  ^  V*AiV 
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(ii)  Let  {Aj  :  V”_fc+1  -A  M™a}  be  a  sequence  of  functions  that  satisfy  the  semidefinite  relations 

^ev'XjV  ^  gWAAV) 

for  all  V  G  V”_fe+1,  where  g  :  (0,oo)  -A  [0,  oo).  Then,  for  all  t  6  M, 

P  {Xk  (J23  X-> )  ~  l>fo  ve“hlfc+1  e~9t  ' tr6Xp  {ff(0)  2^  A^V)}  • 

The  first  bound  in  Theorem  3.3  requires  less  detailed  information  on  how  compression  affects 
the  summands  but  correspondingly  does  not  give  as  sharp  results  as  the  second. 

In  the  following  two  sections,  we  use  the  minimax  Laplace  transform  method  to  derive  Chernoff 
and  Bernstein  inequalities  for  the  interior  eigenvalues  of  a  sum  of  independent  random  matrices. 
Tail  bounds  for  the  eigenvalues  of  matrix  Rademacher  and  Gaussian  series,  eigenvalue  Hoeffding, 
and  matrix  martingale  eigenvalue  tail  bounds  can  all  be  derived  in  a  similar  manner;  see  [Trollc] 
for  relevant  details. 

4.  Chernoff  bounds 

Classical  Chernoff  bounds  establish  that  the  tails  of  a  sum  of  independent  nonnegative  random 
variables  decay  subexponentially.  [Trollc]  develops  Chernoff  bounds  for  the  maximum  and  mini¬ 
mum  eigenvalues  of  a  sum  of  independent  positive-semidefinite  matrices.  We  extend  this  analysis 
to  study  the  interior  eigenvalues. 

Intuitively,  the  eigenvalue  tail  bounds  should  depend  on  how  concentrated  the  summands  are; 
e.g.,  the  maximum  eigenvalue  of  a  sum  of  operators  whose  ranges  are  aligned  is  likely  to  vary 
more  than  that  of  a  sum  of  operators  whose  ranges  are  orthogonal.  To  measure  how  much  a 
finite  sequence  of  random  summands  {Xj}  concentrates  in  a  given  subspace,  we  define  a  function 
'L  :  Ui<fc<n^fc  that  satisfies 

rnaxj  Amax  ( V*XjV )  <  \H(V)  almost  surely  for  each  V  €  l^J1<fc<  VJL  (4.1) 

The  sequence  {Xj}  associated  with  will  always  be  clear  from  context.  We  have  the  following 
result. 

Theorem  4.1  (Eigenvalue  Chernoff  Bounds).  Consider  a  finite  sequence  {-Xy}  of  independent, 
random,  positive-semidefinite  matrices  with  dimension  n.  Given  an  integer  k  <  n,  define 

hk  =  Afc  (y:  EWj)  , 

and  let  V+  G  V”_A,+1  and  VL  €  be  isometric  embeddings  that  satisfy 

hk  =  Ama*  Vr+*(EXj)V+)  =  Amin  (J2I  V*(EXj)V-)  . 

Then 

(  /  x  ,  r  es  W*(v+) 

F  \Xk  Xi)  ~(1  +  <{n-k  +  l)-  —  ^1+(5  for  6  >  0,  and 

P  [Xk  (J2 j  Xj)  <  (1  -  $)Tk\  <  k  ■  {l_6)1_s  f°r  5  e  [°’  !). 

where  T  is  a  function  that  satisfies  (4.1). 


TAIL  BOUNDS  FOR  EIGENVALUES  OF  RANDOM  MATRICES 


9 


Theorem  4.1  tells  us  how  the  tails  of  the  /cth  eigenvalue  are  controlled  by  the  variation  of  the 
random  summands  in  the  top  and  bottom  invariant  subspaces  of  KXj.  Up  to  the  dimensional 
factors  k  and  n—k+1,  the  eigenvalues  exhibit  binomial- type  tails.  When  k  =  1  (respectively,  k  =  n) 
Theorem  4.1  controls  the  probability  that  the  largest  eigenvalue  of  the  sum  is  small  (respectively,  the 
probability  that  the  smallest  eigenvalue  of  the  sum  is  large),  thereby  complementing  the  one-sided 
Chernoff  bounds  of  [Trollc]. 

Remark  4.1.  If  it  is  difficult  to  estimate  \h(V+)  or  T(m),  one  can  resort  to  the  weaker  estimates 

Ik  (IT)  <  max  max,-  ||V*X,'V||  =  max,'  I IX;  II 

vev«_fe+1 

’k(m)  <  max  max,-  ||U*X;U||  =  max,-  ||X;||  . 

V(zyn  J  11  J  11  J  11  J  11 

Theorem  4.1  follows  from  Theorem  3.3  using  an  appropriate  bound  on  the  matrix  moment 
generating  functions.  The  following  lemma  is  due  to  Ahlswede  and  Winter  [AW02];  see  also  [Trollc, 
Lem.  5.8]. 


Lemma  4.2.  Suppose  that  X  is  a  random  positive-semidefinite  matrix  that  satisfies  Amax 
Then 


Eeex  A  exp 


(V-1)(E  X))  for  9^ 


(■ X )  <  I- 


Proof  of  Theorem  f.l,  upper  bound.  We  consider  the  case  where  T(V+)  =  1;  the  general  case  fol¬ 
lows  by  homogeneity.  Define 

Aj(V+)  =  V*(EXj)V+  and  g(0)  =  ee  -  1. 

Theorem  3.3(ii)  and  Lemma  4.2  imply  that 

F  {Afc  X,-)  >  (1  +  6)iik }  <  inf  e-^+^k  .  trexp  [g{6)  ^  U+*(EX,)V+}  . 

Bound  the  trace  by  the  maximum  eigenvalue,  taking  into  account  the  reduced  dimension  of  the 
summands: 

trexp  {fl(0)  Y,  j  K(EXj')V+}  <(n-k  +  1)  •  Amax  (exp  [g{6)  Yj 

=  (n  -  k  +  1)  •  exp  |5(0)  •  Amax  (j^  U+*(EX,) V+)  }  . 

The  equality  follows  from  the  spectral  mapping  theorem.  Identify  the  quantity  then  combine 
the  last  two  inequalities  to  obtain 

F  |Afc  Xj)  >  (1  +  <5)/xfc}  <  (n  -  k  +  1)  ■  inf  e^~d(1+s^. 

The  right-hand  side  is  minimized  when  9  =  log(l  +  d),  which  gives  the  desired  upper  tail  bound.  □ 


Proof  of  Theorem  f.l,  lower  bound.  As  before,  we  consider  the  case  where  \k(VL)  =  1.  Clearly, 

F  [\k  <  (!  -  S)Tk}  =  F  {An_fc+i  (Yj  ~xj)  >  -(!  -  Vvk}  ■  (4.2) 

Apply  Lemma  4.2  to  see  that,  for  9  >  0, 

Ee0(-WX,vc)  =  Ee(-0)W*W-  -<  exp  (g(9)  •  V*{-EXj)V-), 


where  g(9)  =  1  —  e 
by 


Theorem  3.3(ii)  thus  implies  that  the  latter  probability  in  (4.2)  is  bounded 
inf  e0(1”5)^fe  • tr  exp 
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Using  reasoning  analogous  to  that  in  the  proof  of  the  upper  bound,  we  justify  the  first  of  the 
following  inequalities: 

tr  exp  [g{6)  £  V_!(-E. Xj)V-}  <  k  ■  exp  {Amax  (g(0)  £  V_?(-EX,-)V_)  } 

=  k  •  exp  {- g{6 )  •  Amin  (J2J  V_*(EX,-)V_)  } 

=  k  ■  exp  {-g(8)nk}  . 

The  remaining  equalities  follow  from  the  fact  that  —g(0)  <  0  and  the  definition  of  /q. . 

This  argument  establishes  the  bound 

P  |Afc  xj)  <  (!  -  <%fc}  <  k  •  inf  e[0(1-5)-s(0)1/ife. 

The  right-hand  side  is  minimized  when  6  =  —  log(l— 6),  which  gives  the  desired  lower  tail  bound.  □ 

5.  Bennett  and  Bernstein  inequalities 

The  classical  Bennett  and  Bernstein  inequalities  use  the  variance  or  knowledge  of  the  moments  of 
the  summands  to  control  the  probability  that  a  sum  of  independent  random  variables  deviates  from 
its  mean.  In  [Trollc],  matrix  Bennett  and  Bernstein  inequalities  are  developed  for  the  extreme 
eigenvalues  of  self-adjoint  random  matrix  sums.  We  establish  that  the  interior  eigenvalues  satisfy 
analogous  inequalities. 

As  in  the  derivation  of  the  Chernoff  inequalities  of  section  4,  we  need  a  measure  of  how  concen¬ 
trated  the  random  summands  are  in  a  given  subspace.  Recall  that  the  function  'k  :  Ui<fc<n 
satisfies 

rnaxj  Am av  (V*XjV)  <  \k(V)  almost  surely  for  each  V  G  l^J1<fc<  V£.  (5-1) 

The  sequence  {Ay}  associated  with  T  will  always  be  clear  from  context. 

Theorem  5.1  (Eigenvalue  Bennett  Inequality).  Consider  a  finite  sequence  {Ay}  of  independent, 
random,  self-adjoint  matrices  with  dimension  n,  all  of  which  have  zero  mean.  Given  an  integer 
k  <  n,  define 

4  =  v  (£.  E(xj)) . 

Choose  V+  G  V”_fc+1  to  satisfy 

4  =  Amax  (^ .  V*E(X])V+)  . 

Then,  for  all  t  >  0, 

'{A‘(EAi)>»}  <  (n-t  +  1)-exp{-^p.ft(U)p)}  (i) 

<(„-*  +  !)■  exp  {^4^44  (..) 

<  j(n-/c  +  l)-exp{-§t2/crjf:}  fort  <  a2k/MV+) 

|(n-fc  +  l)  •exp{-§t/'k(Vr+)}  for  t  >  a2k/^(V+), 

where  the  function  h(u)  =  (1  +  u)  log(l  +  u)  —  u  for  u  >  0.  The  function  \k  satisfies  (5.1)  above. 

Results  (i)  and  (ii)  are,  respectively,  matrix  analogs  of  the  classical  Bennett  and  Bernstein 
inequalities.  As  in  the  scalar  case,  the  Bennett  inequality  reflects  a  Poisson-type  decay  in  the  tails 
of  the  eigenvalues.  The  Bernstein  inequality  states  that  small  deviations  from  the  eigenvalues  of 
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the  expected  matrix  are  roughly  normally  distributed  while  larger  deviations  are  subexponential. 
The  split  Bernstein  inequalities  (iii)  make  explicit  the  division  between  these  two  regimes. 

As  stated,  Theorem  5.1  estimates  the  probability  that  the  eigenvalues  of  a  sum  are  large.  Using 
the  identity 

Afc  =  —  An_fc+i  (—  Xj'j  , 

Theorem  5.1  can  be  applied  to  estimate  the  probability  that  eigenvalues  of  a  sum  are  small. 

To  prove  Theorem  5.1,  we  use  the  following  lemma  (Lemma  6.7  in  [Trollc])  to  control  the 
moment  generating  function  of  a  random  matrix  with  bounded  maximum  eigenvalue. 

Lemma  5.2.  Let  X  be  a  random  self-adjoint  matrix  satisfying  EX  =  0  and  Amax  ( X )  <  1  almost 
surely.  Then 

Eeex  P  exp((e0  -6-  1)  •  E(X2))  for  6  >  0. 


Proof  of  Theorem  5.1.  Using  homogeneity,  we  assume  without  loss  that  \&(V+)  =  1.  This  implies 
that  Amax  (Xj)  <  1  almost  surely  for  all  the  summands.  By  Lemma  5.2, 

EeeXj  P  exp  (g(0)-  E(X2)), 

with  g(9)  =  ee  —  9  —  1. 

Theorem  3.3(i)  then  implies 

P  {A  k  X,)  >  t}  <  inf  e~et  •  trexp  {g(9)  £  VfE(X])V+} 

<{n-k  +  1)  •  inf  e~et  •  Amax  (exp  [g{9)  £  V+*E(X2)V+}) 

=  in  -  k  +  1)  •  inf  e~9t  •  exp  {g(6)  ■  Amax  VfE(X])V+)  }  . 

The  maximum  eigenvalue  in  this  expression  equals  cr|,  thus 

P  {  V  (£.  Vj)  >(}  <  (n-fc+l).  mf 

The  Bennett  inequality  (i)  follows  by  substituting  0  =  log(l  +  t/aff  into  the  right-hand  side  and 
simplifying. 

The  Bernstein  inequality  (ii)  is  a  consequence  of  (i)  and  the  fact  that 

u 2  /2 

h{u)  >  — — —  for  u>  0, 

1  +  u/5 

which  can  be  established  by  comparing  derivatives. 

The  subgaussian  and  subexponential  portions  of  the  split  Bernstein  inequalities  (iii)  are  verified 
through  algebraic  comparisons  on  the  relevant  intervals.  □ 

Occasionally,  as  in  the  application  in  section  7  to  the  problem  of  covariance  matrix  estimation, 
one  desires  a  Bernstein-type  tail  bound  that  applies  to  summands  that  do  not  have  bounded 
maximum  eigenvalues.  In  this  case,  if  the  moments  of  the  summands  satisfy  sufficiently  strong 
growth  restrictions,  one  can  extend  classical  scalar  arguments  to  obtain  results  such  as  the  following 
Bernstein  bound  for  subexponential  matrices. 

Theorem  5.3  (Eigenvalue  Bernstein  Inequality  for  Subexponential  Matrices).  Consider  a  finite 
sequence  m  of  independent,  random,  self-adjoint  matrices  with  dimension  n,  all  of  which  satisfy 
the  subexponential  moment  growth  condition 

E(Xf  )  P  form  =  2,3,4,..., 
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where  B  is  a  positive  constant  and  X2  are  positive-semidefinite  matrices.  Given  an  integer  k  <  n, 
set 

hk  =  A k  fy]  E-Xj)  . 

Choose  V+  £  V”_fc+1  that  satisfies 

hk  =  Amax  (y  .  V;(EXi)V+)  , 

and  define 

4  =  Amax  V+^V+)  ■ 

Then,  for  any  t  >  0. 

P{Afc  Xj)  >  Tk  +  t}  <  (n  -  k  +  1)  •  exp  | -  J2  +  Bt  j 

(n  —  fc  +  1)  •  exp{-|f2/o|} 

(n  —  fc  +  1)  •  exp  {  — jt/S} 


(i) 

fort<a2k/B 

(n) 

/or  t  >  4/B- 


This  result  is  an  extension  of  [Trollc,  Theorem  6.2],  which,  in  turn,  generalizes  a  classical  scalar 
argument  [DG98]. 

As  with  the  other  matrix  inequalities,  Theorem  5.3  follows  from  an  application  of  Theorem  3.3 
and  appropriate  semidefinite  bounds  on  the  moment  generating  functions  of  the  summands.  Thus, 
the  key  to  the  proof  lies  in  exploiting  the  moment  growth  conditions  of  the  summands  to  majorize 
their  moment  generating  functions.  The  following  lemma,  a  trivial  extension  of  Lemma  6.8  in 
[Trollc],  provides  what  we  need. 


Lemma  5.4.  Let  X  be  a  random  self-adjoint  matrix  satisfying  the  subexponential  moment  growth 
conditions 

E(Xm)  r<  y'Z2  for  m  =  2,3,4,.... 

Then,  for  any  0  in  [0, 1), 


Eexp(0X)  ^  exp 


^6>EX  + 


e 2 

2(1  -e) 


Proof  of  Theorem  5.3.  We  note  that  Xj  satisfies  the  growth  condition 


3  '  -  2 

if  and  only  if  the  scaled  matrix  Xj  / B  satisfies 


E(Xf)  ^  — £m~2X2  for  m  >  2 


X\m  m  I  X2 
E  (  y  j  -  ~Y  ’  £2  for  m  >  2. 


Thus,  by  rescaling,  it  suffices  to  consider  the  case  B  =  1.  We  now  do  so. 

By  Lemma  5.4,  the  moment  generating  functions  of  the  summands  satisfy 

Eexp(0Xj)  ^  exp  (OEXj  +  5(#)X2)  , 
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where  g(9)  =  92 /(2  —  29).  Now  we  apply  Theorem  3.3(i): 

p {v (E, xi)  >»  +  *}< ^ o-m-n+,>  ■  tr®p {e E .  v;(EXj)v+  +  g(0) E .  v?sjv+} 

<  inf  (ra  -  k  +  1)  •  exp  (  -  9(pk  +  t)  +  9  •  Amax  .  V+  (EX,-) V+) 

+  9(0)-  Amax  (£ .  V+*£,V+)  } 

=  inf  (n  —  k  +  1)  •  exp  (— +  g(9)ak)  . 

To  achieve  the  final  simplification,  we  identified  pk  and  a2.  Now,  select  9  =  t/(t  +  ct|).  Then 
simplication  gives  the  Bernstein  inequality  (i). 

Algebraic  comparisons  on  the  relevant  intervals  yield  the  split  Bernstein  inequalities  (ii).  □ 


6.  An  APPLICATION  TO  COLUMN  SUBSAMPLING 

As  an  application  of  our  Chernoff  bounds,  we  examine  how  sampling  columns  from  a  matrix 
with  orthonormal  rows  affects  the  spectrum.  This  question  has  applications  in  numerical  linear 
algebra  and  compressed  sensing.  The  special  cases  of  the  maximum  and  minimum  eigenvalues 
have  been  studied  in  the  literature  [Tro08,  RV07].  The  limiting  spectral  distributions  of  matrices 
formed  by  sampling  columns  from  similarly  structured  matrices  have  also  been  studied:  the  results 
of  [GH10]  apply  to  matrices  formed  by  sampling  columns  from  any  fixed  orthogonal  matrix,  and 
[FarlO]  studies  matrices  formed  by  sampling  columns  and  rows  from  the  discrete  Fourier  transform 
matrix.  We  mention  in  particular  [Rud99],  the  main  result  of  which  provides  a  uniform  bound  on 
the  tails  of  all  singular  values  of  the  sampled  matrix.  The  theorem  proven  in  this  section  provides 
bounds  which  reflect  the  differences  in  the  tails  of  the  individual  singular  values,  and  thus  can  be 
viewed  as  an  elaboration  of  the  result  in  [Rud99] . 

Let  U  be  an  n  x  r  matrix  with  orthonormal  rows.  We  model  the  sampling  operation  using  a 
random  diagonal  matrix  D  whose  entries  are  independent  Bern(p)  random  variables.  Then  the 
random  matrix 

U  =  UD  (6.1) 

can  be  interpreted  as  a  random  column  submatrix  of  U  with  an  average  of  pr  nonzero  columns. 
Our  goal  is  to  study  the  behavior  of  the  spectrum  of  U . 

Recall  that  the  jth  column  of  U  is  written  Uj.  Consider  the  following  coherence- like  quantity 
associated  with  U  : 

Th  =  min  max,  II2  for  k  =  1, . . .  ,n.  (6.2) 

veV"  y  > 

There  does  not  seem  to  be  a  simple  expression  for  rk.  However,  by  choosing  V*  to  be  the  restriction 
to  an  appropriate  k- dimensional  coordinate  subspace,  we  see  that  Tk  always  satisfies 

Tk  <  min  max,-  ii2. . 

-  | I\<k  3  ^iei  lJ 

The  following  theorem  shows  that  the  behavior  of  sk(U),  the  kth  singular  value  of  U,  can  be 
explained  in  terms  of  rk. 

Theorem  6.1  (Column  Subsampling  of  Matrices  with  Orthonormal  Rows).  Let  U  be  an  n  x  r 
matrix  with  orthonormal  rows,  and  let  p  be  a  sampling  probability.  Define  the  sampled  matrix  U 
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according  to  (6.1),  and  the  numbers  {r^}  according  to  (6.2).  Then,  for  each  k  =  1, . . ,  ,  ra, 


'  js*:(t7)  >  i/(l  +  <5)pj  <  (n  -  k  +  1) 
'  jsfc(t/)  <  y/{l-  <5)p}  <  k 


1  P/  Tn  —  k+1 


(1  +  (S)1  +  5 

P/Tk 


L(i-5)1-'5J 


for  5  >  0 
for  5  £  [0, 1). 


Proof.  Observe,  using  (6.1),  that 

sk(U)2  =  Xk(UD2U*)  =  Xk  djUju* j  , 

where  Uj  is  the  jth  column  of  U  and  dj  ~  Bern(p).  Compute 

hk  =  Xk  (2  ■  VdjUju*')  =  p  ■  Xk(UU*)  =  p  ■  Afc(I)  =  p. 

It  follows  that,  for  any  V  £  V”_fe+1, 

Amax  (J2,  V*^djUju*)v)  =  p  ■  Amax  (V*V)  =p  =  pk, 

so  the  choice  of  V+  £  V”_fc+1  is  arbitrary.  Similarly,  the  choice  of  VL  €  is  arbitrary.  We  select 
V+  to  be  an  isometric  embedding  that  achieves  Tn-k+\  and  V  to  be  an  isometric  embedding  that 
achieves  rk.  Accordingly, 


^(V+)  =  maxj  \\VfujU*V+\\  =  max 
^(VL)  =  maxj  || VlujU*V-\\  =  rnaxj  llVfl'itjll2  =  rk. 


and 


Theorem  4.1  delivers  the  upper  bound 
P  {sk(U)  >  y/(l  +  4>}  =  P  [\k  (J2j  d 
for  5  >  0  and  the  lower  bound 


.jUjUj 


>  (1  +  5)p  \  <  (n  —  k  +  1) 


p/t„- 


fc+i 


1  - 


’  {a k  djuju j)  <  (1  - 


<  k- 


L(i  +  d)1+<5J 


P/Tk 


Ld-^j 


P  {sfc(C7)  < 

for  8  £  [0,1).  □ 

To  illustrate  the  discriminatory  power  of  these  bounds,  let  U  be  an  n  x  n2  matrix  consisting  of  n 
rows  of  the  n2  x  n2  Fourier  matrix  and  choose  p  =  (log  n)/n  so  that,  on  average,  sampling  reduces 
the  aspect  ratio  from  n  to  logn.  For  n  =  100,  we  determine  upper  and  lower  bounds  for  the  median 
value  of  sk(U )  by  numerically  finding  the  value  of  5  where  the  probability  bounds  in  Theorem  6.1 
equal  1/2.  Figure  1  plots  the  empirical  median  value  along  with  the  computed  interval.  We  see  that 
these  ranges  reflect  the  behavior  of  the  singular  values  more  faithfully  than  the  simple  estimates 
sk(EU)  =  p. 


7.  Covariance  Estimation 

We  conclude  with  an  extended  example  that  illustrates  how  this  circle  of  ideas  allows  one  to 
answer  interesting  statistical  questions.  Specifically,  we  investigate  the  convergence  of  the  individual 
eigenvalues  of  sample  covariance  matrices,  with  errors  measured  in  relative  precision. 

Covariance  estimation  is  a  basic  and  ubiquitious  problem  that  arises  in  signal  processing,  graph¬ 
ical  modeling,  machine  learning,  and  genomics,  among  other  areas.  Let  {r] j}"=1  C  W  be  i.i.d.  sam¬ 
ples  drawn  from  some  distribution  with  zero  mean  and  covariance  matrix  C.  Define  the  sample 
covariance  matrix 

c„  =  -£.  VjVj ■ 
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Figure  1 .  [Spectrum  of  a  random  submatrix]  The  matrix  U  is  a  102  x  10 4  submatrix 
of  the  unitary  DFT  matrix  with  dimension  104,  and  the  sampling  probability  p  = 
1CF4 log(104).  The  kth  vertical  bar,  calculated  using  Theorem  6.1,  describes  an 
interval  containing  the  median  value  of  the  fcth  singular  value  of  the  sampled  matrix 
U.  The  black  circles  denote  the  empirical  medians  of  the  singular  values  of  U , 
calculated  from  500  trials.  The  gray  circles  represent  the  singular  values  of  EE/. 


An  important  challenge  is  to  determine  how  many  samples  are  needed  to  ensure  that  the  empirical 
covariance  estimator  has  a  fixed  relative  accuracy  in  the  spectral  norm.  That  is,  given  a  fixed  e, 
how  large  must  n  be  so  that 

f|C„-C||  <e||C||?  (7.1) 

This  estimation  problem  has  been  studied  extensively.  It  is  now  known  that  for  distributions  with 
a  finite  second  moment,  Ll(p\ogp)  samples  suffice  [Rud99],  and  for  log-concave  distributions,  kl(p) 
samples  suffice  [ALPTJ11],  More  broadly,  Vershynin  [Verll]  conjectures  that,  for  distributions 
with  finite  fourth  moment,  Ll(p)  samples  suffice;  he  establishes  this  result  to  within  iterated  log 
factors.  In  [S V 1 1] ,  Srivastava  and  Vershynin  establish  that  D(p)  samples  suffice  for  distributions 
which  have  finite  2  +  e  moments,  for  some  e  >  0,  and  satisfy  an  additional  regularity  condition. 

Inequality  (7.1)  ensures  that  the  difference  between  the  /ctli  eigenvalues  of  Cn  and  C  is  small, 
but  it  requires  O (p)  measurements  to  obtain  estimates  of  even  a  few  of  the  eigenvalues.  Specifically, 
letting  Ke  =  \\(C)/ \g(C),  we  see  that  0{e~2k2p)  measurements  are  required  to  obtain  relative- 
error  estimates  of  the  dominant  £  eigenvalues  of  C  using  the  results  of  [ALPTJ11,  Verll,  SV11]. 
However,  it  is  reasonable  to  expect  that  when  the  spectrum  of  C  exhibits  decay  and  l  <C  p, 
much  fewer  than  O  (p)  measurements  should  suffice  for  relative-error  recovery  of  the  dominant  £ 
eigenvalues. 

In  this  section,  we  derive  a  relative  approximation  bound  for  each  eigenvalue  of  C  that  allows 
us  to  confirm  this  intuition.  For  simplicity  we  assume  the  samples  are  drawn  from  a  A7(0,  C) 
distribution  where  C  is  full-rank,  but  the  arguments  can  be  extended  to  cover  other  distributions. 

Theorem  7.1.  Assume  that  C  €  Mfa  is  positive  definite.  Let  {r]j}'j=1  C  Mp  be  i.i.d.  samples  drawn 
fromaAf(0,C)  distribution.  Define 


c„  =  -£.  vjVj- 
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Write  Xk  for  the  kth  eigenvalue  of  C ,  and  write  Afc  for  the  kth  eigenvalue  of  Cn.  Then  for  k  = 

1, . . .  tp, 

— c  nt2 


'  |  Afc  >  Afc  +  t  j  <  (p  -  k  +  1)  •  exp  (  \k  jyt  ^  for  t  -  4nAfc’ 


and 


'  { Afc  <  Afc  -  t\  <  k  •  exp  [  J  fo 

J  V7'!  Z^i=l  AiJ 


r  t  <  4nAi, 


where  the  constant  c  is  at  least  1/32. 

The  following  corollary  provides  an  answer  to  our  question  about  relative  error  estimates. 
Corollary  7.2.  Let  Xk  and  Xk  be  as  in  Theorem  7.1.  Then 


Afc  >  (1  +  e)Afc  \  <  (p  —  k  +  1)  •  exp 


— c  ne 

y A L 

.  Z-^i=k 


for  e  <  4 n, 


and 


’{Afc  <  (1 


—  £ 


<  k  • exp 


— c  ne 


Ai  sr^k 

.  At.  1  At. 


for  ee  (0,1], 


where  the  constant  c  is  at  least  1/32. 


The  first  bound  in  Corollary  7.2  tells  us  how  many  samples  are  needed  to  ensure  that  Afc  does 
not  overestimate  Afc.  Likewise,  the  second  bound  tells  us  how  many  samples  ensure  that  Afc  does 
not  underestimate  Afc. 

Corollary  7.2  suggests  that  the  relationship  of  Afc  to  Afc  is  determined  by  the  spectrum  of  C  in 
the  following  manner.  When  the  eigenvalues  below  Afc  are  small  compared  with  Afc,  the  quantity 

aP 

Ji=k 


Xl=fc 


is  small,  and  so  Afc  is  not  likely  to  overestimate  Afc.  Similarly,  when  the  eigenvalues  above  Afc  are 
comparable  with  Afc,  the  quantity 

yltc/v 

is  small,  and  so  Afc  is  not  likely  to  underestimate  Afc. 

We  now  have  everything  needed  to  establish  Theorem  1.1. 


Proof  of  Theorem  1.1  from  Corollary  7.2.  From  Corollary  7.2,  we  see  that 

P {Afc  <  (1  —  e)Afc|  <  p~P  when  n  >  32e~2  (logfc  +  /3\ogp). 

Recall  that  Kk  =  X\{C) / Xk{C) .  Clearly,  taking  n  =  Tt{e~2 njilogp)  samples  ensures  that,  with  high 
probability,  each  of  the  top  t  eigenvalues  of  the  sample  covariance  matrix  satisfies  Afc  >  (1  —  e) Afc. 
Likewise, 

i>k  Afc 

Assuming  the  stated  decay  condition,  that 

£,>/■  =  °<A‘>- 


^  (log(p  -  k  +  1)  +  ft  log p). 


'  { Afc  >  (1  +  e)Afc  j  <  p  13  when  n  >  32e  2 


E 
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we  see  that  taking  n  =  il(e  2(£  +  nf)  logp)  samples  ensures  that,  with  high  probability,  each  of  the 
top  £  eigenvalues  of  the  sample  covariance  matrix  satisfies  A,  <  (1  +  £)\k- 

Combining  these  two  results,  we  conclude  that  n  =  n2£\ogp)  ensures  that  the  top  £  eigen¬ 
values  of  C  are  estimated  to  within  relative  precision  1  ±  e.  □ 

Remark  7.1.  The  results  in  Theorem  7.1  and  Corollary  7.2  also  apply  when  C  is  rank-deficient: 
simply  replace  each  occurence  of  the  dimension  p  in  the  bounds  with  rank(C). 

7.1.  Proof  of  Theorem  7.1.  We  now  prove  Theorem  7.1.  This  result  requires  supporting  lemmas; 
we  defer  their  proofs  until  after  a  discussion  of  extensions  to  Theorem  7.1. 

We  study  the  error  \Xk(Cn)  —  Xk(C)\.  To  apply  the  methods  developed  in  this  paper,  we  pass 
to  a  question  about  the  eigenvalues  of  a  difference  of  two  matrices.  The  first  lemma  accomplishes 
this  goal  by  compressing  both  the  population  covariance  matrix  and  the  sample  covariance  matrix 
to  a  fixed  invariant  subspace  of  the  population  covariance  matrix. 

Lemma  7.3.  Let  X  be  a  random  self-adjoint  matrix  with  dimension  p ,  and  let  A  be  a  fixed  self- 
adjoint  matrix  with  dimension  p.  Choose  W+  G  V pp_k+1  and  W _  G  Vp  for  which 

A  k(A)  =  Amax  (WfAW+)  =  Amin  (W1AW-) . 

Then,  for  all  t  >  0, 

P{A,(X)  >  A,(A)+f}  <P{Amax(W£XW+)  >  Xk{A)  +  t}  (7.2) 

and 


P{Afc(X)  <  Xk(A)-t}  <P{Amax(^(-X)W_)  >  -A k(A)  +  t}.  (7.3) 

We  apply  this  result  with  A  =  C  and  X  =  Cn.  Because  Cn  is  unbounded,  we  apply  Theorem 
5.3  to  handle  the  estimates  in  (7.2)  and  (7.3).  To  use  this  theorem,  we  need  the  following  moment 
growth  estimate  for  rank-one  Wishart  matrices. 

Lemma  7.4.  Let  £  ~  jV(0,  G).  Then  for  any  integer  m  >  2, 

E  (£C)m  r<  2mm!(tr  G)m~1  ■  G. 


With  these  preliminaries  addressed,  we  prove  Theorem  7.1. 


Proof  of  upper  estimate.  First  we  consider  the  probability  that  A^  overestimates  A*,. 
V^_fc+1  satisfy 


A k(C)  =  Amax  ( WfCW+ )  . 


Then  Lemma  7.3  implies 


Let  W+  G 


P  {A  k(Cn)  >  A  k(C)  +  t}  <  P  |  Amax  (wfCnW+)  >  Afc(C)  +  t} 

=  P  { Amax  W;(nr/*)W^  >  n\k(C)  +  nt }  .  (7.4) 


The  factor  n  comes  from  the  normalization  of  the  sample  covariance  matrix. 

The  covariance  matrix  of  rjj  is  C,  so  that  of  Wfrjj  is  W7CW+.  Apply  Lemma  7.4  to  verify  that 
WfVjr1jW+  satisfies  the  subexponential  moment  growth  bound  required  by  Theorem  5.3  with 


B  =  2tv(W*CW+)  and  T,2  =  8ti(WfCW+)  ■  WfCW+. 

In  fact,  WfCW+  is  the  compression  of  C  to  the  invariant  subspace  corresponding  with  its  bottom 
p  —  k  +  1  eigenvalues,  so 


B  = 2  Yl’=k  A*(c)  and  Amax  =  8Afc(c)  HI, Ai(c)- 
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We  are  concerned  with  the  maximum  eigenvalue  of  the  sum  in  (7.4),  so  we  take  V+  =  I  in  the 
statement  of  Theorem  5.3  to  find  that 

erf  =  Amax  (^ .  £,2)  =  nAmax  (£?)  =  8nAfc(C)  M<?)  and 

IM  =  Amax  (J]  W;E(Vjt)*)W+)  =  nAmax  {W*+CW+)  =  n\k(C). 

It  follows  from  the  subgaussian  branch  of  the  split  Bernstein  inequality  of  Theorem  5.3  that 

r  {a„„  (£.  WHniV‘)W+)  >  n\t(C)  +  nt}  <  (p  -  k  +  1)  .  exp  (32At(c)^  A|(c)) 
when  t  <  4nAfc(C).  This  provides  the  desired  bound  on  the  probability  that  \k(Cn)  overestimates 

A  k(C).  □ 

Proof  of  lower  estimate.  Now  we  consider  the  probability  that  Xk  underestimates  \k-  The  proof 
proceeds  similarly  to  the  proof  of  the  upper  estimate.  Let  W_  G  Vp  satisfy 

A k(C)  =  Amin  ( WlCW _)  . 

Then  Lemma  7.3  implies 

P  {Afe(Cn)  <  A k(C)  -  t}  <  P  {Amax  (W*(-C„)W_)  >  -nAfc(C)  +  nt} 

=  P  {Amax  (2  .  >  -n\k(C)  +  nt }  (7.5) 

The  factor  n  comes  from  the  normalization  of  the  sample  covariance  matrix. 

The  covariance  matrix  of  r]j  is  C,  so  that  of  Wfrjj  is  W1CW-.  Apply  Lemma  7.4  to  verify 
that  for  any  integer  m  >  2, 

A  E (WlrijtfW-)"1  A  2mm! tr(W*CW_)m_1  •  WlCW _. 

Thus,  W*_{-r,3ri*)W-  satisfies  the  subexponential  moment  growth  bound  required  by  Theorem 
5.3  with 

B  =  2tr(W*CW_)  and  £2  =  8tr(W*CW_)  •  WlCW— 

In  fact,  W  CW  is  the  compression  of  C  to  the  invariant  subspace  corresponding  with  its  top  k 
eigenvalues,  so 

B  = 2 Eti Ai(c)  and  Amax  (s;)  =  8Ai(C)  Hi,  w)- 

We  are  concerned  with  the  maximum  eigenvalue  of  the  sum  in  (7.5),  so  we  take  V+  =  I  in  the 
statement  of  Theorem  5.3  to  find  that 

01  =  Amax  (y  .  =  nAmax  (£?)  =  8nAi(C)  A i(C)  and 

MI  =  Amax  (2  .  W!E(-»7j»7;)W_)  =  nAmax  (W*(-C)W_)  =  -nAfc(C). 

It  follows  from  the  subgaussian  branch  of  the  split  Bernstein  inequality  of  Theorem  5.3  that 

P  {a„,„  (£.  WH-wPW-)  >  -n\ t(C)  +  nt}  <  k  ■  exp 


when  t  <  4nAi(C).  This  provides  the  desired  bound  on  the  probability  that  Xk{Cn)  underestimates 

A  k(C).  '  □ 
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7.2.  Extensions  of  Theorem  7.1.  Results  analogous  to  Theorem  7.1  can  be  established  for  other 
distributions.  If  the  distribution  is  bounded,  the  possibility  that  Xk  deviates  above  or  below  Xk 
can  be  controlled  using  the  Bernstein  inequality  of  Theorem  5.1.  If  the  distribution  is  unbounded 
but  has  matrix  moments  that  satisfy  a  sufficiently  nice  growth  condition,  the  probability  that  Xk 
deviates  below  A*,  as  well  as  the  probability  that  it  deviates  above  A^  can  be  bounded  using  a 
Bernstein  inequality  analogous  to  that  in  Theorem  5.3. 

Theorem  7.1  controls  the  error  in  the  /cth  sample  eigenvalue  in  terms  of  all  the  eigenvalues  of 
the  covariance  matrix,  so  it  is  most  useful  when  the  eigenvalues  of  the  covariance  matrix  satisfy 
decay  conditions  such  as  those  given  in  the  statement  of  Theorem  1.1.  If  such  conditions  are  not 
satisfied,  the  results  of  [ALPTJ11]  on  the  convergence  of  empirical  covariance  matrices  of  isotropic 
log-concave  random  vectors  lead  to  tighter  bounds  on  the  probabilities  that  A/c  overestimates  or 
underestimates  Xk- 

To  see  the  relevance  of  the  results  in  [ALPTJ11],  first  observe  the  following  consequence  of  the 
subadditivity  of  the  maximum  eigenvalue  mapping: 

Amax  {w;(X  -  A)W+)  >  Amax  (WIXW+)  -  Amax  (W^AW+) 

=  xma,x{w;xw+)  -xk(A). 

In  conjunction  with  (7.2),  this  gives  us  the  following  control  on  the  probability  that  Xk(X)  overes¬ 
timates  A k(A)  : 

F{\k(X)  >  A k(A)  +  t}  <  P  {Amax  (W*(X  -  A)W+)  >  t]  . 

In  our  application,  X  is  the  empirical  covariance  matrix  and  A  is  the  actual  covariance  matrix. 
The  spectral  norm  dominates  the  maximum  eigenvalue,  so 

P  {A k(Cn)  >  Afc(C)  +  t}  <  P  {Amax  (wi(Cn  -  C)  W+)  >  t] 

<p{||W|(C„-C)W+||  >t}  =f,{\\W*CnW+  -  S2||  >t}, 

where  S  is  the  square  root  of  W+CW+.  Now  factor  out  S 2  and  identify  A k(C)  =  ||'S'2||  to  obtain 

p  {Afc(C)  >  Afc(C)  +  f}  <  P  {||5-1ir;cniT+5- 

=  p|||5-1lT_;CnlT+5~ 

Note  that  if  r]  is  drawn  from  a  jV(0,  C)  distribution,  then  the  covariance  matrix  of  the  transformed 
sample  S~1W+ri  is  the  identity: 

E  (S^Wlri^W+S-1)  =  S^W^CW+S”1  =  I. 

Thus  S^WlCnW+S"1  is  the  empirical  covariance  matrix  of  a  standard  Gaussian  vector  in 
®>p-fc+i.  By  Theorem  1  of  [ALPTJ11],  it  follows  that  Xk  is  unlikely  to  overestimate  Xk  in  rela¬ 
tive  error  when  the  number  n  of  samples  is  tt(p  —  k  +  1).  A  similar  argument  shows  that  Xk  is 
unlikely  to  underestimate  Xk  in  relative  error  when  n  =  0,(Kpk). 

Similarly,  for  more  general  distributions,  the  bounds  on  the  probability  of  Xk  overestimating  or 
underestimating  Xk  can  be  tightened  beyond  those  suggested  in  Theorem  7.1  by  using  the  results  in 
[ALPTJ11]  or  [Verll].  Note,  however,  that  one  cannot  use  knowledge  of  spectral  decay  to  sharpen 
the  results  obtained  from  [ALPTJ11]  and  [Verll]  into  estimates  like  those  given  in  Theorem  1.1. 

Finally,  we  note  that  the  techniques  developed  in  the  proof  of  Theorem  7.1  can  be  used  to 
investigate  the  spectrum  of  the  error  matrices  Cn  —  C . 


-I||||S2II>(} 

-I||  >t/V(C)}. 
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7.3.  Proofs  of  the  supporting  lemmas.  We  now  establish  the  lemmas  used  in  the  proof  of 
Theorem  7.1. 


Proof  of  Lemma  7.3.  The  probability  that  Xk(X)  overestimates  A^(A)  is  controlled  with  the  se¬ 
quence  of  inequalities 

E{Xk(X)  >  Xk(A)  +  t}  =  pj  inf  Amai(nif)>4(A)+i 

{wev£_fe+1 

<  IP  {Amax  (WfXW+)  >  X  k(A)  +  t}  . 

We  use  a  related  approach  to  study  the  probability  that  A k(X)  underestimates  A k(A).  Our  choice 
of  W~  implies  that 


F{Afc(X)  <  Xk(A)-t} 


max  Amin  {W*XW)  <  X k(A)  -  t 

wevl 


<  IP  {Amin  (W1XW-)  <  X k(A)  -  t} 

=  P{Amax(W*(-A:)W_)  >  -A k(A)+t}. 


This  establishes  the  bounds  on  the  probabilities  of  A k(X)  deviating  above  or  below  A k(A). 


□ 


Proof  of  Lemma  7.f.  Factor  the  covariance  matrix  of  £  as  G  =  UAU*  where  U  is  orthogonal  and 
A  =  diag(Ai, . . . ,  Xp)  is  the  matrix  of  eigenvalues  of  G.  Let  7  be  a  Af(0,  Ip)  random  variable.  Then 
£  and  £/A1//27  are  identically  distributed,  so 


E(«*r 


e  [(rom~^r] 

UA1/2E  [(7*A7) 


=  E 


(7*A7)m_117A1/277*A1/217* 


m“177*]  A^2U*. 


(7.6) 


Consider  the  (i,j)  entry  of  the  bracketed  matrix  in  (7.6): 

E  [(7*A7)m'17i7i]  =  E  Xnf)  lilj 


(7.7) 


From  this  expression,  and  the  independence  of  the  Gaussian  variables  {7*},  we  see  that  this  matrix 
is  diagonal. 

To  bound  the  diagonal  entries,  use  a  multinomial  expansion  to  further  develop  the  sum  in  (7.7) 
for  the  (i,  i)  entry: 


E  [(7* A rT-V]  =  £, 


m  —  1 


Jh-\ 1 — Hv=m-\  \ii, . . .  ,£p 

Denote  the  Lr  norm  of  a  random  variable  X  by 

\\X\\r  =  (E|AH1/r. 


h 


■  •  ApE 


2lp  _ 
7i  lv  7 * 


Since  t\ are  nonnegative  integers  summing  to  m  —  1,  the  generalized  AM-GM  inequality 
justifies  the  first  of  the  following  inequalities: 


E7f1---7*S2<E 


( ^1 171 1  H - +  C-p\lp\  +  \li 


V 


m 


2m 


1 

m 


< 


1  +  £  1  +  •  •  •  +  if 
m 


2  m 


Ml  z=ng 


2  m\ 


2  m 

2m 
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The  second  inequality  is  the  triangle  inequality  for  Lr  norms.  Now  we  reverse  the  multinomial 
expansion  to  see  that  the  diagonal  terms  satisfy  the  inequality 


E  [(7*A7)— Sf]  <  £ 


m  —  1 

^iH - \-ip=m— 1  .  .  .  ,  £p 


=  (Ai  +  . . .  +  A p)m~lK{gZm)  =  tr(G)m~  E(g  m). 


(7.8) 


Estimate  E (g2m)  using  the  fact  that  T(x')  is  increasing  for  x  >  1  : 

om  om  ora 

E  ( g2m )  =  —Tim  +  1/2)  <  ——Y[m  +  1)  =  — =m!  for  m  >  1. 
y  t  y7T  y7T 

Combine  this  result  with  (7.8)  to  see  that 

ora 

E  ((7*A7)m-177l  =<  —7=m!  tr(G)m_1  •  I. 

L  J  v71" 

Complete  the  proof  by  using  this  estimate  in  (7.6). 


□ 
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